Among all cancer types, lung cancer is the most lethal among both males and females. It is responsible for 28% of cancer-related mortality. The life expectation for those who have lung cancer are very poor. It is usually diagnosed in an advanced phase, and the patients only have a 16% survival rate after 5 years of their diagnosis.
The Cancer Genome Atlas (TCGA) has comprehensively profiled this type of cancer in a patient cohort. Here, we analyze the expression profiles of those patients, using a pipeline based on the R/Bioconductor software package Rsubread.
This document is written in R markdown and
should be processed using R and you need to install the packages
knitr and markdown. Moreover, it using the official style for Bioconductor vignettes facilitated by the Bioconductor package BiocStyle. It is compiled from different files following the instructions included in the makefile.
Lung cancer, specifically lung squamous cell carcinoma, was assessed and analyzed in this experiment. We start importing the raw table of counts.
library(SummarizedExperiment)
se.unpaired <- readRDS(file.path("rawCounts", "seLUSC.rds"))
se.unpaired
class: RangedSummarizedExperiment
dim: 20115 553
metadata(5): experimentData annotation cancerTypeCode
cancerTypeDescription objectCreationDate
assays(1): counts
rownames(20115): 1 2 ... 102724473 103091865
rowData names(3): symbol txlen txgc
colnames(553): TCGA.18.3406.01A.01R.0980.07
TCGA.18.3407.01A.01R.0980.07 ... TCGA.90.7767.11A.01R.2125.07
TCGA.92.7340.11A.01R.2045.07
colData names(549): type bcr_patient_uuid ...
lymph_nodes_aortic_pos_by_ihc lymph_nodes_aortic_pos_total
table(se.unpaired$type)
normal tumor
51 502
From this, we see that 20115 genes from a total of 553 samples are in this data. We can also observe that this dataset contains 502 tumor samples and 51 normal samples, which matches the information of the dataset from The Cancer Genome Atlas (TCGA) program.
Next, we explored the column (phenotypic) data, which in this case corresponds to clinical variables, and their corresponding metadata. For example, the column “type” consists of tumor and normal labels. Other information such as gender, tumor stage, and smoking status are also shown.
In any case, we can access all the possible labels with the last line of code, which returns two columns of information about the clinical variables. One called labelDescription contains a succint description of the variable, often not more self-explanatory than the variable name itself, and the other called ‘CDEID’ corresponds to the so-called Common Data Element (CDE) identifier. This identifier can be use in https://cdebrowser.nci.nih.gov to search for further information about the associated clinical variable using the Advanced search form and the Public ID attribute search.
dim(colData(se.unpaired))
[1] 553 549
colData(se.unpaired)[1:5, 1:5]
DataFrame with 5 rows and 5 columns
type bcr_patient_uuid
<factor> <factor>
TCGA.18.3406.01A.01R.0980.07 tumor 95b83006-02c9-4c4d-bf84-a45115f7d86d
TCGA.18.3407.01A.01R.0980.07 tumor 4e1ad82e-23c8-44bb-b74e-a3d0b1126b96
TCGA.18.3408.01A.01R.0980.07 tumor d4bc755a-2585-4529-ae36-7e1d88bdecfe
TCGA.18.3409.01A.01R.0980.07 tumor b09e872a-e837-49ec-8a27-84dcdcabf347
TCGA.18.3410.01A.01R.0980.07 tumor 99599b60-4f5c-456b-8755-371b1aa7074e
bcr_patient_barcode form_completion_date
<factor> <factor>
TCGA.18.3406.01A.01R.0980.07 TCGA-18-3406 2011-3-9
TCGA.18.3407.01A.01R.0980.07 TCGA-18-3407 2011-3-9
TCGA.18.3408.01A.01R.0980.07 TCGA-18-3408 2011-3-9
TCGA.18.3409.01A.01R.0980.07 TCGA-18-3409 2011-3-17
TCGA.18.3410.01A.01R.0980.07 TCGA-18-3410 2011-4-4
prospective_collection
<factor>
TCGA.18.3406.01A.01R.0980.07 NO
TCGA.18.3407.01A.01R.0980.07 NO
TCGA.18.3408.01A.01R.0980.07 NO
TCGA.18.3409.01A.01R.0980.07 NO
TCGA.18.3410.01A.01R.0980.07 NO
mcols(colData(se.unpaired), use.names=TRUE)
DataFrame with 549 rows and 2 columns
labelDescription
<character>
type sample type (tumor/normal)
bcr_patient_uuid bcr patient uuid
bcr_patient_barcode bcr patient barcode
form_completion_date form completion date
prospective_collection tissue prospective collection indicator
... ...
lymph_nodes_pelvic_pos_total total pelv lnp
lymph_nodes_aortic_examined_count total aor lnr
lymph_nodes_aortic_pos_by_he aln pos light micro
lymph_nodes_aortic_pos_by_ihc aln pos ihc
lymph_nodes_aortic_pos_total total aor-lnp
CDEID
<character>
type NA
bcr_patient_uuid NA
bcr_patient_barcode 2673794
form_completion_date NA
prospective_collection 3088492
... ...
lymph_nodes_pelvic_pos_total 3151828
lymph_nodes_aortic_examined_count 3104460
lymph_nodes_aortic_pos_by_he 3151832
lymph_nodes_aortic_pos_by_ihc 3151831
lymph_nodes_aortic_pos_total 3151827
Now, we explore the row (feature) data, which provide information on genes. For the first line of command, we can see the length of each gene and the GC content. The second line of command provides further information, such as the ranges within individual chromosome.
rowData(se.unpaired)
DataFrame with 20115 rows and 3 columns
symbol txlen txgc
<character> <integer> <numeric>
1 A1BG 3322 0.564419024683925
2 A2M 4844 0.488232865400495
9 NAT1 2280 0.394298245614035
10 NAT2 1322 0.389561270801815
12 SERPINA3 3067 0.524942940984676
... ... ... ...
100996331 POTEB 1706 0.430832356389215
101340251 SNORD124 104 0.490384615384615
101340252 SNORD121B 81 0.407407407407407
102724473 GAGE10 538 0.505576208178439
103091865 BRWD1-IT2 1028 0.592412451361868
rowRanges(se.unpaired)
GRanges object with 20115 ranges and 3 metadata columns:
seqnames ranges strand | symbol txlen
<Rle> <IRanges> <Rle> | <character> <integer>
1 chr19 58345178-58362751 - | A1BG 3322
2 chr12 9067664-9116229 - | A2M 4844
9 chr8 18170477-18223689 + | NAT1 2280
10 chr8 18391245-18401218 + | NAT2 1322
12 chr14 94592058-94624646 + | SERPINA3 3067
... ... ... ... . ... ...
100996331 chr15 20835372-21877298 - | POTEB 1706
101340251 chr17 40027542-40027645 - | SNORD124 104
101340252 chr9 33934296-33934376 - | SNORD121B 81
102724473 chrX 49303669-49319844 + | GAGE10 538
103091865 chr21 39313935-39314962 + | BRWD1-IT2 1028
txgc
<numeric>
1 0.564419024683925
2 0.488232865400495
9 0.394298245614035
10 0.389561270801815
12 0.524942940984676
... ...
100996331 0.430832356389215
101340251 0.490384615384615
101340252 0.407407407407407
102724473 0.505576208178439
103091865 0.592412451361868
-------
seqinfo: 455 sequences (1 circular) from hg38 genome
In this analysis, we wanted to compare tumor and normal cells. In order to do this, we wanted to compare tumor and normal cells that come from the same patient. This also allows to pair the samples which reduces the possibility of false positives. In order to get them, we execute the following code.
# get the number of occurences of each patient
occur <- data.frame(table(substr(colnames(se.unpaired), 9, 12)))
# get those that occur twice
paired_df <- occur[occur$Freq > 1,]
paired <- as.vector(paired_df$Var1)
paired <- paired[2:length(paired)]
mask <- is.element(substr(colnames(se.unpaired), 9, 12), paired)
se <- se.unpaired[ ,mask]
saveRDS(se, file.path("results", "se.paired.rds"))
Since this data is unprocessed, we must first perform a quality assessment and normalize accordingly. To do this, we need first to load the edgeR R/Bioconductor package. This requires the creation of a `DGEList’ object, as the package doesn’t work directly with SummarizedExperiment objects. In any case, we will update any change in the DGElist object to the SummarizedExperiment object.
library(edgeR)
dge <- DGEList(counts=assays(se)$counts, genes=mcols(se))
saveRDS(dge, file.path("results", "dge.paired.rds"))
One way of normalizing is to use the counts per million (CPM) values. This value is calculated by dividing the number of counts of each sample by 1 million. Instead of using this directly, we calculate \(\log_2\) CPM values of expression because it has nicer distributional properties than raw counts or non-logged CPM units. We set this information as an additional assay element to ease their manipulation.
assays(se)$logCPM <- cpm(dge, log=TRUE, prior.count=0.5)
assays(se)$logCPM[1:5, 1:5]
TCGA.22.4593.01A.21R.1820.07 TCGA.22.4609.01A.21R.2125.07
1 2.433139 1.760377
2 9.013463 10.810625
9 -6.817220 -6.817220
10 -6.817220 -6.817220
12 5.605909 5.647356
TCGA.22.5471.01A.01R.1635.07 TCGA.22.5472.01A.01R.1635.07
1 2.122666 0.775287
2 7.713750 11.100735
9 -6.817220 -6.817220
10 -6.817220 -6.817220
12 4.823947 5.856939
TCGA.22.5478.01A.01R.1635.07
1 3.713844
2 9.448174
9 -6.817220
10 -6.817220
12 4.920820
Let’s examine the sequencing depth in terms of total number of sequence read counts mapped to the genome per sample. Figure 1 below shows the sequencing depth per sample, also known as library sizes, in increasing order.
Figure 1: Library sizes in increasing order
This figure reveals no big changes in sequencing depth between samples. In addition, we see a uniform distribution of tumor and normal samples across the figure. We see a cluster of normal samples on the right side with a higher cpm, but don’t believe this will affect our analyses. We can obtain the samples with less sequencing depth using the commands below.
sampledepth <- round(dge$sample$lib.size / 1e6, digits=1)
names(sampledepth) <- substr(colnames(se), 6, 12)
sort(sampledepth)
43.6773 56.7823 77.7335 56.8309 56.8083 58.8386 77.7335 43.6773 56.7582
28.8 29.0 31.0 32.0 32.9 33.4 34.9 35.0 35.9
56.8201 56.8201 56.8623 34.8454 77.7142 56.7579 56.7730 56.8083 56.8082
36.5 37.0 37.0 37.6 38.7 39.7 40.2 40.3 40.4
56.8082 51.4081 34.8454 22.5481 56.8309 77.7337 33.4587 56.7823 56.7580
40.7 41.9 42.1 42.3 43.6 44.0 45.6 45.7 45.8
56.7579 85.7710 58.8386 34.7107 56.7731 22.5471 77.7142 56.7580 56.7731
45.9 46.5 46.6 47.3 47.6 47.8 48.4 49.7 50.0
85.7710 43.3394 39.5040 22.5472 60.2709 51.4079 90.6837 51.4080 77.7338
50.1 50.2 51.0 51.1 51.2 51.5 51.6 51.9 52.0
56.7222 77.7338 77.7138 56.7582 22.5471 43.5670 22.4609 22.5483 92.7340
52.4 52.4 52.8 52.9 53.6 53.6 54.2 54.4 54.6
77.8008 34.7107 90.7767 43.7657 22.4609 43.7657 77.7138 33.6737 43.7658
54.9 55.1 55.7 55.8 56.3 56.7 57.0 57.2 57.4
77.7337 77.8007 43.5670 56.7222 56.8623 33.4587 43.7658 43.6143 22.5481
57.6 57.6 58.0 58.0 59.1 59.6 59.9 60.0 60.3
22.5491 43.6647 92.7340 39.5040 22.5489 22.5482 77.8008 22.5489 22.5491
61.0 61.3 61.3 61.5 63.5 64.5 65.2 65.5 66.0
77.8007 22.5478 90.6837 90.7767 56.7730 33.6737 60.2709 22.5478 43.6143
66.4 68.0 68.2 68.4 68.8 69.3 69.8 71.0 75.9
22.5472 22.5482 22.4593 43.6771 22.5483 43.6647 22.4593 43.6771 51.4081
76.2 77.0 77.9 79.4 80.3 81.5 85.0 88.8 103.7
51.4080 51.4079 43.3394
119.5 122.6 124.1
Next, we will look at the distribution of expression values per sample in terms of logarithmic CPM units. We display tumor and normal samples separately, and are shown in Figure 2. Each colored line in the graphs below represent a sample. In both graphs, we see two modes. The low mode represents genes not expressed in that sample. The second mode represents the genes that are expressed.
Figure 2: Non-parametric density distribution of expression profiles per sample
In both graphs, we see some genes that deviate a little from the rest. These could be potential outliers and we have noted them for later analysis.
Next, we wanted to look at the distribution of expression levels across genes. To do this, we calculate the average expression per gene through all the samples. Figure 3 shows the distribution of those values.
Figure 3: Distribution of average expression level per gene
RNA sequence expression profiles that come from lowly-expressed genes can lead to artifacts in downstream differential expression analyses. Thus, it is common to set a threshold and filter out genes that fall below this value. In the light of this plot above, we may consider a cutoff of 1 log CPM unit as minimum value of expression to select genes being expressed across samples. Using this cutoff we proceed to filter out lowly-expressed genes. Now, we have a total of 11872 genes and 102 samples.
mask <- avgexp > 1
dim(se)
[1] 20115 102
se.filt <- se[mask, ]
dim(se.filt)
[1] 11872 102
dge.filt <- dge[mask, ]
dim(dge.filt)
[1] 11872 102
We then store the un-normalized versions of the filtered expression data.
saveRDS(se.filt, file.path("results", "se.paired.filt.unnorm.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.filt.unnorm.rds"))
Next, we calculate the normalization factors on the filtered expression data set using the calcNormFactors function.
dge.filt <- calcNormFactors(dge.filt)
Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.
assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE, normalized.lib.sizes=TRUE, prior.count=0.25)
Store normalized versions of the filtered expression data.
saveRDS(se.filt, file.path("results", "se.paired.filt.rds"))
saveRDS(dge.filt, file.path("results", "dge.paired.filt.rds"))
We examine now the MA-plots of the normalized expression profiles. These plots allow us to visualize how one sample compares to the average of the rest of the samples. We look first to the tumor samples in Figure 4.
Figure 4: MA-plots of the tumor samples
In the MA-plots of the tumor samples, we observed the following:
We observe that the appearance of samples that differ from the mean can be problematic. If during downstream analysis we observe unexpected results, those samples should be removed together with their normal pairs. We can obtain their names with the following code:
big_c
[1] "TCGA.56.7823" "TCGA.56.8623" "TCGA.77.7138"
Next, we look now to the normal samples in Figure 5.
Figure 5: MA-plots of the normal samples
In this case, we observed:
In conclusion, we do not observe either important expression-level dependent biases among the normal samples.
Batch effects are sub-groups of measurements that have a qualitatively different behavior across conditions and are unrelated to the variables in the study. It is important to determine if batch effects are present to know if any confounding variables are affecting the data. We will search now for potential surrogate of batch effect indicators within normal and tumor samples. Given that each sample names corresponds to a TCGA barcode (see https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode), following the strategy described in http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overview we are going to derive different elements of the TCGA barcode and examine their distribution across samples.
tss <- substr(colnames(se.filt), 6, 7)
table(data.frame(TYPE=se.filt$type, TSS=tss))
TSS
TYPE 22 33 34 39 43 51 56 58 60 77 85 90 92
normal 10 2 2 1 8 3 12 1 1 7 1 2 1
tumor 10 2 2 1 8 3 12 1 1 7 1 2 1
center <- substr(colnames(se.filt), 27, 28)
table(data.frame(TYPE=se.filt$type, CENTER=center))
CENTER
TYPE 07
normal 51
tumor 51
plate <- substr(colnames(se.filt), 22, 25)
table(data.frame(TYPE=se.filt$type, PLATE=plate))
PLATE
TYPE 0980 1100 1635 1758 1820 1858 1949 2045 2125 2187 2247 2296 2326
normal 0 0 5 4 7 1 4 10 10 2 4 2 1
tumor 1 3 6 0 7 0 4 10 10 2 4 2 1
PLATE
TYPE A28V
normal 1
tumor 1
portionanalyte <- substr(colnames(se.filt), 18, 20)
table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
PORTIONANALYTE
TYPE 01R 11R 21R 31R 41R
normal 48 3 0 0 0
tumor 11 28 8 2 2
samplevial <- substr(colnames(se.filt), 14, 16)
table(data.frame(TYPE=se.filt$type, SAMPLEVIAL=samplevial))
SAMPLEVIAL
TYPE 01A 01B 11A
normal 0 0 51
tumor 50 1 0
From this information we can make the following observations:
colnames(se.filt)[samplevial == "01B"]
[1] "TCGA.56.7823.01B.11R.2247.07"
With those results, we use the portion of analyte as surrogate of batch effect indicator. Considering our outcome of interest as molecular changes between sample types, tumor vs. normal, we will examine now the cross-classification of this outcome with portion of analyte.
table(data.frame(TYPE=se.filt$type, PORTIONANALYTE=portionanalyte))
PORTIONANALYTE
TYPE 01R 11R 21R 31R 41R
normal 48 3 0 0 0
tumor 11 28 8 2 2
As we can see here, the majority of normal samples (48) come from the “01R”" analyte and 3 come from the “11R” analyte. However, the tumor samples span across five different analytes. This could potentially lead to a batch effect.
In order to examine how the samples group together, we use two approaches: hierarchical clustering and multidimensional scaling, annotating the outcome of interest and the surrogate of batch indicator, or portion of analyte in our case.
We perform this under a subset of our samples, as we wish to have a matrix without zeros. We remove samples belonging to portions 21R, 31R and 41R; and as the experiment is paired, also their pairs.
por1 <- as.vector(substr(colnames(se.filt)[portionanalyte == "41R"], 9, 12))
por2 <- as.vector(substr(colnames(se.filt)[portionanalyte == "31R"], 9, 12))
por3 <- as.vector(substr(colnames(se.filt)[portionanalyte == "21R"], 9, 12))
allpor <- c(por1, por2, por3)
table(is.element(substr(colnames(se.filt), 9, 12), allpor))
FALSE TRUE
78 24
se.batch <- se.filt[ ,!is.element(substr(colnames(se.filt), 9, 12), allpor)]
dge.batch <- DGEList(counts=assays(se.batch)$counts, genes=mcols(se.batch))
new.portionanalyte <- substr(colnames(se.batch), 18, 20)
table(data.frame(TYPE=se.batch$type, PORTIONANALYTE=new.portionanalyte))
PORTIONANALYTE
TYPE 01R 11R
normal 36 3
tumor 11 28
dim(dge.batch)
[1] 11872 78
We see that we are able to remove 24 samples, which correspond to the 12 in the portions we want to remove and their pairs.
We calculate again log CPM values with a higher prior count to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 6.
logCPM <- cpm(dge.batch, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(new.portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.batch)
outcome <- paste(substr(colnames(se.batch), 9, 12), as.character(se.batch$type), sep="-")
names(outcome) <- colnames(se.batch)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Figure 6: Hierarchical clustering of the samples
We can observe that samples cluster primarily by sample type, tumor or normal. We observe that the different portions are present in both clusters, so we don’t consider that it is inducing batch effect.
We also see that one of the tumor samples, 8623-tumor, is present in the normal samples cluster. This one also had a strong deviation in the MA plot, so we might consider discarding it.
In Figure 7 we show the corresponding MDS plot. Here we see more clearly that the first source of variation separates tumor from normal samples, and we don’t see any sample that is separated from any cluster.
plotMDS(dge.batch, labels=outcome, col=batch)
legend("bottomleft", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))),
fill=sort(unique(batch)), inset=0.05)
Figure 7: Multidimensional scaling plot of the samples
Finally, under the consideration that there is no batch effect, the 24 samples that were removed in order to make the dendogram and the MDS plot will remain for further analysis. We decide to generate the dendogram in Figure 8 again in order to see if any other sample appears in its opposite cluster, apart from 8623-tumor.
logCPM <- cpm(dge.filt, log=TRUE, prior.count=3)
d <- as.dist(1-cor(logCPM, method="spearman"))
sampleClustering <- hclust(d)
batch <- as.integer(factor(portionanalyte))
sampleDendrogram <- as.dendrogram(sampleClustering, hang=0.1)
names(batch) <- colnames(se.filt)
outcome <- paste(substr(colnames(se.filt), 9, 12), as.character(se.filt$type), sep="-")
names(outcome) <- colnames(se.filt)
sampleDendrogram <- dendrapply(sampleDendrogram,
function(x, batch, labels) {
if (is.leaf(x)) {
attr(x, "nodePar") <- list(lab.col=as.vector(batch[attr(x, "label")]))
attr(x, "label") <- as.vector(labels[attr(x, "label")])
}
x
}, batch, outcome)
plot(sampleDendrogram, main="Hierarchical clustering of samples")
legend("topright", paste("Batch", sort(unique(batch)), levels(factor(new.portionanalyte))), fill=sort(unique(batch)))
Figure 8: Hierarchical clustering of the samples
We observe that only the 8623-tumor sample doesn’t cluster with the rest of the tumor samples.
Sample removal:
dim(se.filt)
[1] 11872 102
dim(dge.filt)
[1] 11872 102
number <- as.character(8623)
se.filt <- se.filt[ ,!is.element(substr(colnames(se.filt), 9, 12), number)]
dge.filt <- DGEList(counts=assays(se.filt)$counts, genes=mcols(se.filt))
dim(se.filt)
[1] 11872 100
dim(dge.filt)
[1] 11872 100
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] geneplotter_1.60.0 annotate_1.60.1
[3] XML_3.98-1.20 AnnotationDbi_1.44.0
[5] lattice_0.20-38 edgeR_3.24.3
[7] limma_3.38.3 SummarizedExperiment_1.12.0
[9] DelayedArray_0.8.0 BiocParallel_1.16.6
[11] matrixStats_0.54.0 Biobase_2.42.0
[13] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[15] IRanges_2.16.0 S4Vectors_0.20.1
[17] BiocGenerics_0.28.0 knitr_1.23
[19] BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 RColorBrewer_1.1-2 compiler_3.5.3
[4] BiocManager_1.30.4 highr_0.8 XVector_0.22.0
[7] bitops_1.0-6 tools_3.5.3 zlibbioc_1.28.0
[10] bit_1.1-14 digest_0.6.19 memoise_1.1.0
[13] RSQLite_2.1.1 evaluate_0.14 Matrix_1.2-17
[16] DBI_1.0.0 yaml_2.2.0 xfun_0.7
[19] GenomeInfoDbData_1.2.0 stringr_1.4.0 bit64_0.9-7
[22] locfit_1.5-9.1 grid_3.5.3 rmarkdown_1.13
[25] bookdown_0.11 blob_1.1.1 magrittr_1.5
[28] codetools_0.2-16 htmltools_0.3.6 xtable_1.8-4
[31] KernSmooth_2.23-15 stringi_1.4.3 RCurl_1.95-4.12
Differential expression analysis consists of identifying changes in gene expression. For RNAseq experiments, this means comparing relative concentrations of RNA molecules. In R, this can be done by using the limma and sva R/Bioconductor packages.
limma allows for linear model analysis of data for DE. This requires building two model matrices: a matrix of the model including the outcome of interest and adjustment variables, and a matrix of the null model that only includes adjustment variables. The outcome of interest is the type of sample (tumor vs normal), and as both types of samples come from the same patient the analysis is paired, so a unique identifier for each patient is included. Finally, as there are unknown factors to adjust, an intercept term of 1 is given to the null model.
Even if no known sources of variation were found, surrogate variables representing unknown sources of variation can be added to the model. Those can be estimated with the sva package, and later be appended to the model. Finally, voom is used to fit the model as the library size distribution is not uniform across samples, and the moderated t-statistics are calculated.
library(sva)
# Obtain the patient ID, as the analysis is paired
patientid <- substr(colnames(se.filt), 9, 12)
samplevial <- substr(colnames(se.filt), 14, 16)
# Build the model (mod) and null (mod0) matrices
mod <- model.matrix(~type + patientid, data = colData(se.filt))
mod0 <- model.matrix(~patientid, data = colData(se.filt))
# Estimate surrogate variables
sv <- sva(assays(se.filt)$logCPM, mod = mod, mod0 = mod0)
Number of significant surrogate variables is: 11
Iteration (out of 5 ):1 2 3 4 5
# Add surrogates to the design matrix
len <- length(colnames(mod))
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:len], paste0("SV", 1:sv$n))
v <- voom(dge.filt, mod, plot=TRUE)
Figure 9: Mean-variance plot for DE analysis
fit <- lmFit(v, mod) # Fit the model with the voom weights
fit <- eBayes(fit) # Calculate moderated t-statistics
sva was able to estimate 11 surrogate variables. Regarding the mean-variance plot shown in Figure ??, which was obtained with voom, it shows the variance in the y axis and the mean of expression in the x axis for each gene, represented as a black dot. Such plots contain a red curve that is fitted to the dots and show trends in the data. Normally increases in expression levels mean a decrease in the variance, until a plateau is reached for highly expressed genes. This trend is observed in the plot obtained for this data. Moreover, the plot is also a diagnosis of filtering steps performed upstream. The fact that there is not a drop in variance for the lower expression values indicates that the filtering threshold was good.
As we are performing multiple tests, the probability of type I errors (rejecting the null hypotheses when it is true) increases. There are multiple statistical strategies to decrease the likelihood of these errors, such as the false discovery rate (FDR). This technique consists of defining an acceptable rate of type I errors (false discoveries). Here, a FDR of 0.01 is set, which means that 1% of the rejections of the null hypothesis will be incorrect. For our data set, which has 11872 observations, up to 119 genes can be false positives (FP).
FDRcutoff <- 0.01
res <- decideTests(fit, p.value = FDRcutoff)
summary(res)
(Intercept) typetumor patientid3394 patientid4079 patientid4080
Down 4 4089 0 16 0
NotSig 768 2682 11870 11853 11871
Up 11100 5101 2 3 1
patientid4081 patientid4587 patientid4593 patientid4609 patientid5040
Down 1 4 1 0 0
NotSig 11870 11867 11869 11871 11870
Up 1 1 2 1 2
patientid5471 patientid5472 patientid5478 patientid5481 patientid5482
Down 0 0 0 5 1
NotSig 11871 11870 11871 11866 11870
Up 1 2 1 1 1
patientid5483 patientid5489 patientid5491 patientid5670 patientid6143
Down 0 1 0 0 1
NotSig 11871 11869 11870 11871 11869
Up 1 2 2 1 2
patientid6647 patientid6737 patientid6771 patientid6773 patientid6837
Down 4 1 1 1 0
NotSig 11867 11867 11869 11870 11871
Up 1 4 2 1 1
patientid7107 patientid7138 patientid7142 patientid7222 patientid7335
Down 1 1 5 0 4
NotSig 11870 11871 11865 11871 11866
Up 1 0 2 1 2
patientid7337 patientid7338 patientid7340 patientid7579 patientid7580
Down 0 1 6 0 1
NotSig 11870 11870 11862 11870 11870
Up 2 1 4 2 1
patientid7582 patientid7657 patientid7658 patientid7710 patientid7730
Down 1 4 4 10 1
NotSig 11870 11866 11868 11860 11869
Up 1 2 0 2 2
patientid7731 patientid7767 patientid7823 patientid8007 patientid8008
Down 5 0 5 5 9
NotSig 11864 11870 11866 11866 11862
Up 3 2 1 1 1
patientid8082 patientid8083 patientid8201 patientid8309 patientid8386
Down 6 0 0 0 0
NotSig 11865 11870 11871 11872 11871
Up 1 2 1 0 1
patientid8454 SV1 SV2 SV3 SV4 SV5 SV6 SV7 SV8 SV9
Down 7 1277 317 603 593 62 34 27 13 202
NotSig 11864 8901 9085 10839 10614 11547 11592 11811 11849 11565
Up 1 1694 2470 430 665 263 246 34 10 105
SV10 SV11
Down 65 0
NotSig 11689 11871
Up 118 1
The summary table displays the upregulated, not significant, and downregulated genes for each variable in the model.
Gene metadata such as the gene symbol and the chromosome can be added for an easier interpretation of the obtained statistics, and to observe the chromosome distribution of the differentially expressed genes.
# Get gene info...
genesmd <- data.frame(chr = as.character(seqnames(rowRanges(se.filt))), symbol = rowData(se.filt)[, 1], stringsAsFactors = FALSE)
# and add it to the results
fit$genes <- genesmd
tt <- topTable(fit, coef = 2, n = Inf)
head(tt, n = 20)
chr symbol logFC AveExpr t P.Value
171482 chrX FAM9A 6.766405 1.477480 36.85864 6.843818e-34
2261 chr4 FGFR3 4.348392 4.461243 34.53058 1.003148e-32
51569 chr13 UFM1 5.802336 3.035598 33.92022 2.085602e-32
9263 chr7 STK17A 5.509825 7.124666 32.45828 1.266931e-31
51804 chr14 SIX4 3.296163 4.291469 32.44698 1.285100e-31
23217 chr19 ZFR2 5.120053 6.564764 32.37119 1.414022e-31
54536 chr10 EXOC6 4.766545 2.127249 31.64391 3.576728e-31
100616453 chr11 MIR4687 3.527821 5.839996 31.46343 4.516773e-31
692205 chr2 SNORD89 4.647090 4.105932 31.10766 7.180600e-31
27316 chrX RBMX 4.427552 3.627725 31.06818 7.561981e-31
6406 chr20 SEMG1 2.555202 4.942304 30.96833 8.621594e-31
2079 chr14 ERH 4.226847 1.320862 31.05712 7.672451e-31
8317 chr1 CDC7 4.744208 2.756026 30.62319 1.360661e-30
117156 chr5 SCGB3A2 4.486026 6.802293 30.38548 1.868232e-30
2149 chr5 F2R 3.206808 3.639705 30.25463 2.226567e-30
158062 chr9 LCN6 8.017857 6.999330 30.16581 2.509203e-30
10643 chr7 IGF2BP3 -4.052705 4.356854 -29.91757 3.510205e-30
11252 chr22 PACSIN2 4.913960 1.672498 29.82982 3.954882e-30
121391 chr12 KRT74 3.939914 3.777500 29.70032 4.718864e-30
5782 chr7 PTPN12 3.323264 8.759576 29.68945 4.789500e-30
adj.P.Val B
171482 8.124980e-30 66.95030
2261 5.954687e-29 64.56615
51569 8.253424e-29 63.71946
9263 2.797878e-28 62.09967
51804 2.797878e-28 62.06905
23217 2.797878e-28 61.99049
54536 6.066132e-28 60.90182
100616453 6.702891e-28 60.82636
692205 8.280668e-28 60.32293
27316 8.280668e-28 60.25544
6406 8.529630e-28 60.18386
2079 8.280668e-28 60.11122
8317 1.242597e-27 59.62395
117156 1.584261e-27 59.40209
2149 1.762254e-27 59.20703
158062 1.861829e-27 59.11295
10643 2.451362e-27 58.77632
11252 2.608465e-27 58.54528
121391 2.843047e-27 58.46417
5782 2.843047e-27 58.43767
This table contains the relevant statistics used in the analysis, such as raw and adjusted p-values. In the case of the later ones, it can be seen some of their values are pretty low, which means that the result is not significant. The number of differentially expressed genes under the 0.01 FDR cuttoff can be checked by applying a filter to this table.
signDEgenes <- rownames(tt)[tt$adj.P.Val < FDRcutoff]
length(signDEgenes)
[1] 9190
The differential expression analysis results in 9190 significantly differentially expressed genes. Those can still be filtered by a log fold change cutoff of 2, with the following chromosome distribution.
tt.sign <- tt[tt$adj.P.Val < FDRcutoff,]
DEgenes <- rownames(tt.sign)[abs(tt.sign$logFC) > 2]
length(DEgenes)
[1] 1455
saveRDS(DEgenes, file.path("results", "DEgenes.rds"))
So we are finally calling differentially expressed to 1455 genes. The chromosome distribution of those genes can be obtained.
tt$chr <- sub( "^chr([X|Y|0-9]+.*)", "\\1", tt$chr, perl = T)
tt$chr <- sub( "(.*)_.*_.*$", "\\1A", tt$chr, perl = T)
#sort(table(tt$chr[tt$adj.P.Val < FDRcutoff]), decreasing = TRUE)
sort(table(tt.sign$chr[abs(tt.sign$logFC) > 2]), decreasing = TRUE)
chr1 chr19 chr11
154 101 99
chr2 chr17 chr12
90 86 85
chr9 chr16 chr5
72 69 65
chr4 chr3 chr10
64 63 61
chr6 chr7 chrX
59 59 59
chr20 chr15 chr14
46 43 42
chr8 chr22 chr13
40 31 27
chr21 chr18 chr1_KI270766v1_alt
24 15 1
#plot(sort(table(tt$chr[tt$adj.P.Val < FDRcutoff & !grepl(".*[A|v][1]*$", tt$chr, perl=TRUE)]), decreasing = TRUE), main = "DE genes chromosome distribution", ylab = "Number of DE genes", xlab = "Chromosome", cex.axis = 0.7)
plot(sort(table(tt.sign$chr[abs(tt.sign$logFC) > 2 & !grepl(".*[A|v][1]*$", tt$chr, perl=TRUE)]), decreasing = TRUE), main = "DE genes chromosome distribution", ylab = "Number of DE genes", xlab = "Chromosome", cex.axis = 0.7)
Warning in abs(tt.sign$logFC) > 2 & !grepl(".*[A|v][1]*$", tt$chr, perl =
TRUE): longer object length is not a multiple of shorter object length
Figure 10: Chromosome distribution of the differentially expressed genes
The distribution is shown in Figure 10, and it does not correspond to a distribution based on the chromosome size, as some chromosomes are located higher in the ranking than what would be expected by its size. The chromosome Y does not have any differentially expressed genes.
A list of the top 10 most significant differentially expressed genes can be obtained:
top <- order(fit$lods[, 2], decreasing = TRUE)[1:10]
fit$genes$symbol[top]
[1] "FAM9A" "FGFR3" "UFM1" "STK17A" "SIX4" "ZFR2" "EXOC6"
[8] "MIR4687" "SNORD89" "RBMX"
The distribution of raw p-values and the QQ plot moderated t-statistics, displayed in Figure 11, can be used as a diagnosis of the tests performed.
par(mfrow = c(1, 2), mar = c(4, 5, 2, 2))
hist(tt$P.Value, xlab = "Raw P-values", main = "", las = 1)
qqt(fit$t[, 2], df = fit$df.prior + fit$df.residual, main = "", pch = ".", cex = 3)
abline(0, 1, lwd = 2)
Figure 11: Raw p-values distribution and QQ plot of the moderated t-statistics for DE analysis
Under the null hypothesis, which is that most genes are not differentially expressed, the raw p-value distribution should be uniform with a peak at low p-values for the truly DE genes. This coincides with the observed plot.
Regarding the QQ plot, it represents the quantiles of our data sample against the theoretical distribution they should follow. The diagonal represents the null hypothesis. Most genes are off-diagonal, indicating differential expression. The fact tha most of the genes are far from the null hypothesis is a measure of the significance of the results.
The differential expression results are diagnosed with volcano and MA plots, in which the top 10 differentially expressed genes are highlighted as shown in Figure ??.
par(mfrow = c(1, 2), mar = c(4, 5, 3, 2))
volcanoplot(fit, coef = 2, highlight = 10, names = fit$genes$symbol, main = "Volcano Plot")
plot(tt$logFC, -log10(tt$P.Value), xlab="Log2 fold-change", ylab="-log10 P-value",
pch=".", cex=2, col=grey(0.75), cex.axis=1, cex.lab=1, las=1)
posx <- tt[tt$adj.P.Val < 0.01, "logFC"]
posy <- -log10(tt[tt$adj.P.Val < 0.01, "P.Value"])
points(posx, posy, pch=".", cex=2, col="black")
Figure 12: Volcano and MA plots for DE analysis without SVA
limma::plotMA(fit, coef = 2, status = rownames(fit$lods) %in% signDEgenes, legend = FALSE,
main = "MA Plot", hl.pch = 46, hl.cex = 4, bg.pch = 46, bg.cex = 3, las = 1)
text(fit$Amean[top], fit$coef[top, 2], fit$genes$symbol[top], cex = 0.5, pos = 4)
Figure 13: Volcano and MA plots for DE analysis without SVA
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] sva_3.30.1 genefilter_1.64.0
[3] mgcv_1.8-28 nlme_3.1-140
[5] geneplotter_1.60.0 annotate_1.60.1
[7] XML_3.98-1.20 AnnotationDbi_1.44.0
[9] lattice_0.20-38 edgeR_3.24.3
[11] limma_3.38.3 SummarizedExperiment_1.12.0
[13] DelayedArray_0.8.0 BiocParallel_1.16.6
[15] matrixStats_0.54.0 Biobase_2.42.0
[17] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[19] IRanges_2.16.0 S4Vectors_0.20.1
[21] BiocGenerics_0.28.0 knitr_1.23
[23] BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] locfit_1.5-9.1 xfun_0.7 splines_3.5.3
[4] htmltools_0.3.6 yaml_2.2.0 blob_1.1.1
[7] survival_2.44-1.1 DBI_1.0.0 bit64_0.9-7
[10] RColorBrewer_1.1-2 GenomeInfoDbData_1.2.0 stringr_1.4.0
[13] zlibbioc_1.28.0 codetools_0.2-16 evaluate_0.14
[16] memoise_1.1.0 highr_0.8 Rcpp_1.0.1
[19] xtable_1.8-4 BiocManager_1.30.4 XVector_0.22.0
[22] bit_1.1-14 digest_0.6.19 stringi_1.4.3
[25] bookdown_0.11 grid_3.5.3 tools_3.5.3
[28] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.12
[31] RSQLite_2.1.1 Matrix_1.2-17 rmarkdown_1.13
[34] compiler_3.5.3
In addition to identifying differential expressed genes by adjusting surrogate variables in the differential expression analysis, a gene set enrichment analysis (GSEA) was also completed. This procedure provides more sensitivity in identifying gene expression changes. This analysis assesses how genes are behaving differently between two phenotypic states. The phenotypic states in this study are tumor and normal samples. The analysis calculates an enrichment score for each gene set. This score provides information on the changes in gene expression by inidividual genes in the gene set.
The first step in running a GSEA was to select a collection of gene sets from the MSigDB gene set collections provided by the Broad Institute. The C2 collection contained 3272 computational gene sets made up of cancer-oriented microarray data.
library(GSEABase)
Loading required package: graph
Attaching package: 'graph'
The following object is masked from 'package:XML':
addNode
geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 11872
#c4 <- getGmt("c4.all.v6.2.entrez.gmt")
#length(c4)
library(GSVAdata)
Loading required package: hgu95a.db
Loading required package: org.Hs.eg.db
data(c2BroadSets)
c2BroadSets
GeneSetCollection
names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
length(c2BroadSets)
[1] 3272
After the collection was selected for the GSEA, the next step was to map the identifiers from the 3272 C2 gene sets to the dataset being analyzed.
gsc <- GeneSetCollection(c(c2BroadSets))
#map identifiers
gsc <- mapIdentifiers(gsc, AnnoOrEntrezIdentifier(metadata(se.filt)$annotation))
gsc
GeneSetCollection
names: NAKAMURA_CANCER_MICROENVIRONMENT_UP, NAKAMURA_CANCER_MICROENVIRONMENT_DN, ..., ST_PHOSPHOINOSITIDE_3_KINASE_PATHWAY (3272 total)
unique identifiers: 5167, 100288400, ..., 57191 (29340 total)
types in collection:
geneIdType: EntrezIdentifier (1 total)
collectionType: BroadCollection (1 total)
A matrix was created to check if the identifiers were properly mapped, or matched to the corresponding identifier in the dataset being analyzed.
Im <- incidence(gsc)
dim(Im)
[1] 3272 29340
Im[1:2, 1:10]
5167 100288400 338328 388 10631 440387
NAKAMURA_CANCER_MICROENVIRONMENT_UP 1 1 1 1 1 1
NAKAMURA_CANCER_MICROENVIRONMENT_DN 0 0 0 0 0 0
54910 10136 3630 6662
NAKAMURA_CANCER_MICROENVIRONMENT_UP 1 1 1 1
NAKAMURA_CANCER_MICROENVIRONMENT_DN 0 0 0 0
Im <- Im[, colnames(Im) %in% rownames(se.filt)]
dim(Im)
[1] 3272 9963
Since we are only interested in the genes from the data set being analyzed, all genes that were not a part of this gene set, denoted by a “0”, were discarded.
se.filt <- se.filt[colnames(Im), ]
dge.filt <- dge.filt[colnames(Im), ]
This left 9963 genes to be analyzed among 100 gene sets from the collection. With the data ready, a SVA was completed without calling any gene differentially expressed. The mean-variance was be the same from the previous differential expression analysis so the plot was not displayed.
patientid <- substr(colnames(se.filt), 9, 12)
mod <- model.matrix(~type + patientid, data = colData(se.filt))
mod0 <- model.matrix(~ patientid, data = colData(se.filt))
sv <- sva(assays(se.filt)$logCPM, mod, mod0)
Number of significant surrogate variables is: 11
Iteration (out of 5 ):1 2 3 4 5
# Add surrogates to the model
len <- length(colnames(mod))
mod <- cbind(mod, sv$sv)
colnames(mod) <- c(colnames(mod)[1:len], paste0("SV", 1:sv$n))
v <- voom(dge.filt, mod)
fit <- lmFit(v, mod)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)
With the t-statistics avaialble, the z-scores were then calculated to see if a shift in gene expression within a gene set was present. First, a filter was set that required the gene sets to have a minimum size of 5 genes.
Im <- Im[rowSums(Im) >= 5, ]
dim(Im)
[1] 2760 9963
The t-statistics were then stored in a vector in order of the incidence matrix. The z-score was then calculated and sorted by the absolute score. The z-score of 5 indicated that this gene set was about 5 units above the mean. A low z-score suggested the genes in the gene set were not differentially expressed.
tGSgenes <- tt[match(colnames(Im), rownames(tt)), "t"]
length(tGSgenes)
[1] 9963
head(tGSgenes)
[1] 6.684433 5.595676 2.657648 -13.938881 3.125743 6.416383
zS <- sqrt(rowSums(Im)) * (as.vector(Im %*% tGSgenes)/rowSums(Im))
length(zS)
[1] 2760
head(zS)
NAKAMURA_CANCER_MICROENVIRONMENT_UP NAKAMURA_CANCER_MICROENVIRONMENT_DN
-2.971891 -3.972357
WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN
-3.271506 -8.729491
WINTER_HYPOXIA_UP WINTER_HYPOXIA_DN
4.224154 9.023336
rnkGS <- sort(abs(zS), decreasing = TRUE)
head(rnkGS)
WHITEFORD_PEDIATRIC_CANCER_MARKERS GEORGES_TARGETS_OF_MIR192_AND_MIR215
44.62003 41.81863
KEGG_CELL_CYCLE PUJANA_XPRSS_INT_NETWORK
40.98101 40.04934
DODD_NASOPHARYNGEAL_CARCINOMA_DN REACTOME_CELL_CYCLE_MITOTIC
40.02843 38.76168
A one sample z-test was calculated and a conservative multiple testing adjustment was administered to see if any gene sets were candidates for differentially expressed genes. The reason for completing a multiple testing adjustment was due to gene set overlaps.
pv <- pmin(pnorm(zS), 1 - pnorm(zS))
pvadj <- p.adjust(pv, method = "fdr")
DEgs <- names(pvadj)[which(pvadj < 0.01)]
length(DEgs)
[1] 2339
As shown, 2339 gene sets were differentially expressed.
In addition to the GSEA, a gene set variation analysis (GSVA) was performed. This differs from the standard GSEA by utilizing a gene set by sample matrix rather than a gene by sample matrix. This difference allowed for the pathway enrichment for each individual sample to be analyzed.
In order to do this, a matrix was created with the number of gene sets and an enrichment score that indicates sample-wise gene-level summaries of expression. Since gene sets with a small amount of genes don’t provide much information, a filter was set to remove those gene sets with 5 or less genes.
library(GSVA)
GSexpr <- gsva(assays(se.filt)$logCPM, gsc, min.sz=5, max.sz=300, verbose=FALSE)
dim(GSexpr)
[1] 2698 100
With the data ready, a SVA was completed without calling any gene differentially expressed.
mod <- model.matrix(~se.filt$type + patientid, data = colData(se.filt))
mod0 <- model.matrix(~ patientid, data = colData(se.filt))
svaobj <- sva(GSexpr, mod, mod0)
Number of significant surrogate variables is: 11
Iteration (out of 5 ):1 2 3 4 5
modSVs <- cbind(mod, svaobj$sv)
corfit <- duplicateCorrelation(GSexpr, modSVs, block = se.filt$cellline)
fit <- lmFit(GSexpr, modSVs)
fit <- eBayes(fit)
tt <- topTable(fit, coef = 2, n = Inf)
DEgs <- rownames(tt[tt$adj.P.Val < 0.01, , drop = FALSE])
length(DEgs)
[1] 1883
The 1883 gene sets that are differentially expressed at 1% FDR are shown in red in Figure 14.
Figure 14: Volcano and MA plots for GVSA
The next step was to complete a gene ontology analysis. This procedure was completed to see if any differentially expressed genes were associated with certain biological processes defined by the Gene Ontology system.
First, a parameter object was built that specified the gene universe of interest, the set of 1455 DE genes, the ontology, and other information. The ontology selected was “BP”, which matched genes to the Biological Processes associated with the GO Terms.
geneUniverse <- rownames(se.filt)
length(geneUniverse)
[1] 9963
library(GOstats)
Loading required package: Category
Loading required package: Matrix
Attaching package: 'Matrix'
The following object is masked from 'package:S4Vectors':
expand
Attaching package: 'GOstats'
The following object is masked from 'package:AnnotationDbi':
makeGOGraph
params <- new("GOHyperGParams", geneIds=DEgenes, universeGeneIds=geneUniverse,
annotation="org.Hs.eg.db", ontology="BP",
pvalueCutoff=0.01, testDirection="over")
Warning in makeValidParams(.Object): removing geneIds not in universeGeneIds
In the GO analysis, GO terms can sometime overlap. Therefore, a conditional test was chosen. An Odds Ratio of “Inf” indicated that all genes within a gene set were differentially expressed.
conditional(params) <- TRUE
hgOverCond <- hyperGTest(params)
hgOverCond
Gene to GO BP Conditional test for over-representation
7579 GO BP ids tested (47 have p < 0.01)
Selected gene set size: 1127
Gene universe size: 8939
Annotation package: org.Hs.eg
goresults <- summary(hgOverCond)
head(goresults)
GOBPID Pvalue OddsRatio ExpCount Count Size
1 GO:0002011 0.0003377497 4.519555 3.5301488 11 28
2 GO:0007189 0.0003426225 2.749349 8.9514487 20 71
3 GO:0010876 0.0005206152 1.889882 24.4588880 41 194
4 GO:0007187 0.0009287784 2.078975 16.0117463 29 127
5 GO:0006631 0.0010427968 1.858127 22.9459671 38 182
6 GO:0007172 0.0011310055 27.821906 0.6303837 4 5
Term
1 morphogenesis of an epithelial sheet
2 adenylate cyclase-activating G protein-coupled receptor signaling pathway
3 lipid localization
4 G protein-coupled receptor signaling pathway, coupled to cyclic nucleotide second messenger
5 fatty acid metabolic process
6 signal complex assembly
Since signficiantly enriched gene sets formed by a few genes are not very reliable, a filter was added so those sets with less than or equal to 5 genes were removed. The table was than ordered by Odds Ratio.
goresults <- goresults[goresults$Size >= 5 & goresults$Count >= 5, ]
goresults <- goresults[order(goresults$OddsRatio, decreasing=TRUE), ]
goresults
GOBPID Pvalue OddsRatio ExpCount Count Size
23 GO:0045737 0.0046008469 6.958111 1.260767 5 10
26 GO:0043537 0.0048376104 5.221231 1.765074 6 14
11 GO:0006376 0.0019328356 4.647006 2.521535 8 20
37 GO:0032309 0.0072136074 4.640500 1.891151 6 15
38 GO:0048025 0.0072136074 4.640500 1.891151 6 15
1 GO:0002011 0.0003377497 4.519555 3.530149 11 28
22 GO:0019369 0.0044747007 4.432386 2.269381 7 18
17 GO:0003416 0.0027814383 4.288994 2.647612 8 21
45 GO:0010171 0.0086905926 3.749519 2.521535 7 20
9 GO:1990573 0.0012586714 3.656810 4.034456 11 32
27 GO:0008333 0.0057406264 3.301808 3.530149 9 28
28 GO:0043536 0.0057406264 3.301808 3.530149 9 28
29 GO:0031055 0.0058532649 3.031801 4.160532 10 33
24 GO:0034508 0.0046068579 2.951682 4.664839 11 37
47 GO:0006336 0.0091859439 2.788541 4.412686 10 35
2 GO:0007189 0.0003426225 2.749349 8.951449 20 71
10 GO:0006821 0.0018980524 2.530286 8.068912 17 64
43 GO:0098661 0.0074672695 2.277330 7.690681 15 61
21 GO:0035265 0.0042352048 2.178922 10.086139 19 80
4 GO:0007187 0.0009287784 2.078975 16.011746 29 127
15 GO:0042180 0.0024372379 2.076760 13.238058 24 105
19 GO:0019935 0.0032098814 1.960036 15.003132 26 119
3 GO:0010876 0.0005206152 1.889882 24.458888 41 194
5 GO:0006631 0.0010427968 1.858127 22.945967 38 182
16 GO:0030198 0.0026756023 1.757798 23.324197 37 185
20 GO:0051222 0.0032660275 1.669765 27.610807 42 219
44 GO:0051047 0.0083337105 1.573876 28.241190 41 224
46 GO:0042391 0.0088202396 1.549489 30.006265 43 238
30 GO:0006820 0.0059600165 1.487018 41.983555 58 333
31 GO:0051301 0.0062639237 1.462798 45.513704 62 361
25 GO:0040007 0.0047313203 1.400415 64.929522 85 515
Term
23 positive regulation of cyclin-dependent protein serine/threonine kinase activity
26 negative regulation of blood vessel endothelial cell migration
11 mRNA splice site selection
37 icosanoid secretion
38 negative regulation of mRNA splicing, via spliceosome
1 morphogenesis of an epithelial sheet
22 arachidonic acid metabolic process
17 endochondral bone growth
45 body morphogenesis
9 potassium ion import across plasma membrane
27 endosome to lysosome transport
28 positive regulation of blood vessel endothelial cell migration
29 chromatin remodeling at centromere
24 centromere complex assembly
47 DNA replication-independent nucleosome assembly
2 adenylate cyclase-activating G protein-coupled receptor signaling pathway
10 chloride transport
43 inorganic anion transmembrane transport
21 organ growth
4 G protein-coupled receptor signaling pathway, coupled to cyclic nucleotide second messenger
15 cellular ketone metabolic process
19 cyclic-nucleotide-mediated signaling
3 lipid localization
5 fatty acid metabolic process
16 extracellular matrix organization
20 positive regulation of protein transport
44 positive regulation of secretion
46 regulation of membrane potential
30 anion transport
31 cell division
25 growth
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] ca_ES.UTF-8/ca_ES.UTF-8/ca_ES.UTF-8/C/ca_ES.UTF-8/ca_ES.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] GO.db_3.7.0 GOstats_2.48.0
[3] Category_2.48.1 Matrix_1.2-17
[5] GSVA_1.30.0 GSVAdata_1.18.0
[7] hgu95a.db_3.2.3 org.Hs.eg.db_3.7.0
[9] GSEABase_1.44.0 graph_1.60.0
[11] sva_3.30.1 genefilter_1.64.0
[13] mgcv_1.8-28 nlme_3.1-140
[15] geneplotter_1.60.0 annotate_1.60.1
[17] XML_3.98-1.20 AnnotationDbi_1.44.0
[19] lattice_0.20-38 edgeR_3.24.3
[21] limma_3.38.3 SummarizedExperiment_1.12.0
[23] DelayedArray_0.8.0 BiocParallel_1.16.6
[25] matrixStats_0.54.0 Biobase_2.42.0
[27] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[29] IRanges_2.16.0 S4Vectors_0.20.1
[31] BiocGenerics_0.28.0 knitr_1.23
[33] BiocStyle_2.10.0
loaded via a namespace (and not attached):
[1] bit64_0.9-7 splines_3.5.3 shiny_1.3.2
[4] statmod_1.4.32 BiocManager_1.30.4 highr_0.8
[7] RBGL_1.58.2 blob_1.1.1 GenomeInfoDbData_1.2.0
[10] yaml_2.2.0 RSQLite_2.1.1 digest_0.6.19
[13] RColorBrewer_1.1-2 promises_1.0.1 XVector_0.22.0
[16] htmltools_0.3.6 httpuv_1.5.1 pkgconfig_2.0.2
[19] bookdown_0.11 zlibbioc_1.28.0 xtable_1.8-4
[22] later_0.8.0 survival_2.44-1.1 magrittr_1.5
[25] mime_0.7 memoise_1.1.0 evaluate_0.14
[28] tools_3.5.3 stringr_1.4.0 locfit_1.5-9.1
[31] compiler_3.5.3 grid_3.5.3 RCurl_1.95-4.12
[34] AnnotationForge_1.24.0 bitops_1.0-6 rmarkdown_1.13
[37] codetools_0.2-16 DBI_1.0.0 R6_2.4.0
[40] bit_1.1-14 shinythemes_1.1.2 Rgraphviz_2.26.0
[43] stringi_1.4.3 Rcpp_1.0.1 xfun_0.7